AITopics | adaptive sgd

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

Neural Information Processing SystemsDec-25-2025, 07:12:20 GMT

The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, two issues remain unsolved in this line of work. First, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al.), this approach results in slower convergence rates under interpolation. Second, intuitive line-search methods equipped with variance-reduction (VR) fail to converge (Dubois-Taine et al.).

polyak stepsize, polyak stepsize and line-search, robust convergence and variance reduction, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.62)

Add feedback

STORM+: Fully Adaptive SGD with Recursive Momentum for Nonconvex Optimization

Neural Information Processing SystemsDec-24-2025, 17:13:10 GMT

In this work we investigate stochastic non-convex optimization problems where the objective is an expectation over smooth loss functions, and the goal is to find an approximate stationary point. The most popular approach to handling such problems is variance reduction techniques, which are also known to obtain tight convergence rates, matching the lower bounds in this case. Nevertheless, these techniques require a careful maintenance of anchor points in conjunction with appropriately selected ``mega-batchsizes. This leads to a challenging hyperparameter tuning problem, that weakens their practicality. Recently, [Cutkosky and Orabona, 2019] have shown that one can employ recursive momentum in order to avoid the use of anchor points and large batchsizes, and still obtain the optimal rate for this setting. Yet, their method called $\rm{STORM}$ crucially relies on the knowledge of the smoothness, as well a bound on the gradient norms. In this work we propose $\rm{STORM}^{+}$, a new method that is completely parameter-free, does not require large batch-sizes, and obtains the optimal $O(1/T^{1/3})$ rate for finding an approximate stationary point. Our work builds on the $\rm{STORM}$ algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.

adaptive sgd, name change, recursive momentum, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Distributed Stochastic Optimization via Adaptive SGD

Neural Information Processing SystemsNov-20-2025, 22:17:14 GMT

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial method that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method by combining adaptivity with variance reduction techniques. Our analysis yields a linear speedup in the number of machines, constant memory footprint, and only a logarithmic number of communication rounds. Critically, our approach is a black-box reduction that parallelizes any serial online learning algorithm, streamlining prior analysis and allowing us to leverage the significant progress that has been made in designing adaptive algorithms. In particular, we achieve optimal convergence rates without any prior knowledge of smoothness parameters, yielding a more robust algorithm that reduces the need for hyperparameter tuning. We implement our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems.

algorithm, name change, stochastic optimization, (4 more...)

Neural Information Processing Systems

Country: North America > United States (0.08)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Add feedback

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

Neural Information Processing SystemsMay-26-2025, 22:33:09 GMT

The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, two issues remain unsolved in this line of work. First, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al.), this approach results in slower convergence rates under interpolation. Second, intuitive line-search methods equipped with variance-reduction (VR) fail to converge (Dubois-Taine et al.).

polyak stepsize, polyak stepsize and line-search, robust convergence and variance reduction, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.65)

Add feedback

STORM+: Fully Adaptive SGD with Recursive Momentum for Nonconvex Optimization

Neural Information Processing SystemsJan-18-2025, 15:26:27 GMT

In this work we investigate stochastic non-convex optimization problems where the objective is an expectation over smooth loss functions, and the goal is to find an approximate stationary point. The most popular approach to handling such problems is variance reduction techniques, which are also known to obtain tight convergence rates, matching the lower bounds in this case. Nevertheless, these techniques require a careful maintenance of anchor points in conjunction with appropriately selected mega-batchsizes". This leads to a challenging hyperparameter tuning problem, that weakens their practicality. Recently, [Cutkosky and Orabona, 2019] have shown that one can employ recursive momentum in order to avoid the use of anchor points and large batchsizes, and still obtain the optimal rate for this setting.

adaptive sgd, nonconvex optimization, recursive momentum, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.61)

Add feedback

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

Neural Information Processing SystemsJan-18-2025, 06:40:57 GMT

The recently proposed stochastic Polyak stepsize (SPS) and stochastic line-search (SLS) for SGD have shown remarkable effectiveness when training over-parameterized models. However, two issues remain unsolved in this line of work. First, in non-interpolation settings, both algorithms only guarantee convergence to a neighborhood of a solution which may result in a worse output than the initial guess. While artificially decreasing the adaptive stepsize has been proposed to address this issue (Orvieto et al.), this approach results in slower convergence rates under interpolation. Second, intuitive line-search methods equipped with variance-reduction (VR) fail to converge (Dubois-Taine et al.).

polyak stepsize, polyak stepsize and line-search, robust convergence and variance reduction, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.65)

Add feedback

Reviews: Distributed Stochastic Optimization via Adaptive SGD

Neural Information Processing SystemsOct-7-2024, 10:57:34 GMT

Update: I keep my initial rating. As a potential improvement for the paper, I see the authors only tackle the non strictly convex case. I am curious how the result would be modified if the authors assumed strong convexity. Original review: In this paper the authors introduce SVRG OL, a distributed stochastic optimization method for convex optimization. Inspired by SVRG, the authors first compute a high precision estimate of a gradient at an anchor point v using a large number of samples.

accuracy, computation, stochastic optimization, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.61)

Add feedback

A High Probability Analysis of Adaptive SGD with Momentum

Li, Xiaoyu, Orabona, Francesco

arXiv.org Machine LearningJul-28-2020

Stochastic Gradient Descent (SGD) and its variants are the most used algorithms in machine learning applications. In particular, SGD with adaptive learning rates and momentum is the industry standard to train deep networks. Despite the enormous success of these methods, our theoretical understanding of these variants in the nonconvex setting is not complete, with most of the results only proving convergence in expectation and with strong assumptions on the stochastic gradients. In this paper, we present a high probability analysis for adaptive and momentum algorithms, under weak assumptions on the function, stochastic gradients, and learning rates. We use it to prove for the first time the convergence of the gradients to zero in high probability in the smooth nonconvex setting for Delayed AdaGrad with momentum.

artificial intelligence, machine learning, probability, (15 more...)

arXiv.org Machine Learning

2007.14294

Country:

Europe > Austria > Vienna (0.14)
North America > United States (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Add feedback

Distributed Stochastic Optimization via Adaptive SGD

Cutkosky, Ashok, Busa-Fekete, Róbert

Neural Information Processing SystemsFeb-14-2020, 09:12:19 GMT

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent (SGD), is a serial method that is surprisingly hard to parallelize. In this paper, we propose an efficient distributed stochastic optimization method by combining adaptivity with variance reduction techniques. Our analysis yields a linear speedup in the number of machines, constant memory footprint, and only a logarithmic number of communication rounds. Critically, our approach is a black-box reduction that parallelizes any serial online learning algorithm, streamlining prior analysis and allowing us to leverage the significant progress that has been made in designing adaptive algorithms.

adaptive sgd, algorithm, stochastic optimization, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.64)

Add feedback

Filters

Collaborating Authors

adaptive sgd

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

STORM+: Fully Adaptive SGD with Recursive Momentum for Nonconvex Optimization

Distributed Stochastic Optimization via Adaptive SGD

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

STORM+: Fully Adaptive SGD with Recursive Momentum for Nonconvex Optimization

Adaptive SGD with Polyak stepsize and Line-search: Robust Convergence and Variance Reduction

Reviews: Distributed Stochastic Optimization via Adaptive SGD

A High Probability Analysis of Adaptive SGD with Momentum

Distributed Stochastic Optimization via Adaptive SGD